Colab FAQ

For some basic overview and features offered in Colab notebooks, check out: Overview of Colaboratory Features

You need to use the colab GPU for this assignment by selecting:

Runtime   →   Change runtime type   →   Hardware Accelerator: GPU

Part 4: Connecting Text and Images with CLIP

Acknowledgement: This notebook is based on the code from https://colab.research.google.com/github/openai/clip/blob/master/notebooks/Interacting_with_CLIP.ipynb. Credit to OpenAI.

Section I: Interacting with CLIP

This is a self-contained notebook that shows how to download and run CLIP models, calculate the similarity between arbitrary image and text inputs, and perform zero-shot image classifications. The next cells will install the clip package and its dependencies, and check if PyTorch 1.7.1 or later is installed.

Loading the model

clip.available_models() will list the names of available CLIP models.

Image Preprocessing

We resize the input images and center-crop them to conform with the image resolution that the model expects. Before doing so, we will normalize the pixel intensity using the dataset mean and standard deviation.

The second return value from clip.load() contains a torchvision Transform that performs this preprocessing.

Text Preprocessing

We use a case-insensitive tokenizer, which can be invoked using clip.tokenize(). By default, the outputs are padded to become 77 tokens long, which is what the CLIP models expects.

Setting up input images and texts

We are going to feed 8 example images and their textual descriptions to the model, and compare the similarity between the corresponding features.

The tokenizer is case-insensitive, and we can freely give any suitable textual descriptions.

Building features

We normalize the images, tokenize each text input, and run the forward pass of the model to get the image and text features.

Calculating cosine similarity

We normalize the features and calculate the dot product of each pair.

Zero-Shot Image Classification

You can classify images using the cosine similarity (times 100) as the logits to the softmax operation.

Section II: Now let's do a Scavenger Hunt!

We want you to figure out what caption best describes the image below. We will run your caption against images in ImageNet and display the image with the highest network probability. The goal is that your caption paired with the image below will give the highest network output.

imagenet_test_img.png

We will download a subset of ImageNet called Tiny ImageNet. Tiny ImageNet has only 200 classes, with each class having 500 traininig images, 50 validation images and 50 test images.

In order to reduce time and memory consumption, we will only consider the first 1000 images in the test set as the possible search space.

Now, we will run the model for the first 1000 images in the Tiny ImageNet test set. We will display the image that produces the highest network probability with your written caption

Part 4 Question 2 - Prompting CLIP:

Finding the caption for this image was easy, which I succeed in my second try. Basically, my initial approach was indicated the two obvious features/objects in the image, butterfly and flower. Then I narrow down to the second feature, which is the colour of the flower and succeeded to find this image.